Final Project: A Model to Estimate HR Scores (2020-)
1 Introduction
1.1 Prediction Problem
Two of my three dissertation chapters rely on the use of a dependent variable measuring human rights respect. Hitherto, I have depended on Fariss’s Human Rights (HR) Scores,1 but he has not updated his dataset since 2020.2 This poses a problem insofar as it excludes the most recent country-year observations from my universe of cases entirely.3 In my previous final project, and by way of exploratory data analyses (EDAs), I identified a “raw” form of the physical violence index (PVI)—a simple average of the freedoms from political killings and torture indicators featured in the Varieties of Democracy (V-Dem) dataset—as a potential substitute for HR scores. However, for this project, I aim to predict outright the missing data in HR scores whilst more formally appraising the suitability of the raw PVI as a simple substitute for HR scores.
1 For the original article wherein these scores were introduced, see Fariss, 2014.
2 See his Dataverse page for evidence.
3 That is, it is impossible to evaluate the relationship(s) between independent and dependent variables for cases where the latter is systematically missing.
1.2 Key Findings
After implementing and assessing a diversity of methods to predict HR scores, as summarized in this report, I render the following conclusions:
- There exist more accurate ways to approximate HR scores than to use the raw PVI in lieu of it (see Section 5).
- Of these, the best-performing is a KNN-regression model (see Section 6).
- Time-series models exhibit even lower mean prediction errors, yet evidence suggests they may feature greater variance vis-à-vis the KNN model (see Section 8).
1.3 Overview of Report
The remainder of my report is organized as follows:
- An overview of HR scores and V-Dem, including an EDA of the former (Section 2).
- A description of initial data preprocessing (Section 3).
- A summary of the recipes used (Section 4).
- The results of the baseline fits (Section 5).
- The results of the tuned fits (Section 6).
- The final fit’s predictions and an analysis thereof (Section 7).
- A discussion of time-series predictions and how they compare to those of the final fit (Section 8)
- Appendices (Section 9)
2 Data Overview
2.1 HR Scores
The HR Scores dataset, which features the HR scores variable,4 consists of 11275 country-year observations spanning the period 1946-2019. The HR scores variable lacks missingness, with every observation possessing an HR score.
4 This variable is denominated theta_mean in the original dataset. For the remainder of the report, I use “HR scores” to refer to this specific variable, not the dataset from which it originates.
2.1.1 EDA of HR Scores
The distribution of HR scores may be visualized as follows:
Figure 1 shows that the variable isn’t distributed in a perfectly symmetrical manner, exhibiting a slight right-skew and a degree of bimodality. Nevertheless, these features seem relatively minor, and most methods available to correct them do not apply given the nature of the variable.5 I also attempted a Yeo-Johnson transformation of HR scores, but it failed to adequately rectify the problems of skewness and bimodality.6 In view of these considerations, I proceed by keeping HR scores unaltered.
5 Namely, square-root, log, and Box-Cox transformations only work for positive variables, but approximately half of HR Scores’ values are negative.
6 For more, see Figure 4
2.2 V-Dem
The 2024 version of V-Dem (V14) fully consists of 4607 variables and 27734 country-year observations spanning the period 1789-2023. 12185 of these observations cover the years 1946-2019, whereas 716 of them cover 2020-2023. Overall, the dataset features an appreciable degree of missingness, with approximately 26.36% of values in the 1946-2019 period being absent.7 However, rates of missingness are vastly lower for the five “high-level” democracy indices (approximately 0.30%) and the twenty-one “mid-level” components of said indices (approximately 0.59% ).8 These missingness rates are reduced even further following pre-recipe preprocessing (see Section 3).
7 This figure is calculated from all non-ID variables. See scripts/1b_data_quality_check.R for the computation.
8 See scripts/1b_data_quality_check.R again for the relevant computations.
3 Pre-Recipe Preprocessing
Prior to writing my recipes, I complete the following noteworthy preprocessing steps:
- Merge V-Dem and HR Scores.9
- For V-Dem, recode Czechoslovakia’s pre-Velvet Divorce (1992) observations as those of the Czech Republic.10 Doing so rectifies widespread missingness in this case. I deemed this justifiable on two counts:
- Fariss assigns the Czech Republic COW code to Czechoslovakia.
- V-Dem denominates both Czechoslovakia and the Czech Republic “Czechia”; there is no overlap in naming conventions or otherwise with Slovakia, which doesn’t appear in the dataset until 1993.
- Given the wholesale absence of predictors and thus the virtual impossibility of prediction-making, remove country-years that appear in HR scores but not in V-Dem. These cases appear in Table 1, below, where the value on “N Missing” is positive:
9 My actual code also merges these datasets with the Political Terror Scale (PTS) dataset. At the outset, I desired to include variables from the latter in my models, as they were explicitly noted as inputs for the original HR Scores; however, they featured widespread missingness, to the extent that imputing their values proved to be computationally infeasible.
10 I.e., recode cowcode == 315 (Czechoslovakia) to cowcode == 316 (Czech Republic).
- These observations mainly involve:
- Post-Soviet states (e.g., Armenia, Azerbaijan, etc.) and post-WWII states (e.g., West and East Germany).
- European microstates (e.g., Liechtenstein and Monaco) and small island states (e.g., Samoa, St. Vincent and the Grenadines, etc.).
- The former is more tolerable, for they stem from minor discrepancies in coding start/end dates and only pertain to a handful of years. The latter invites more reason for pause; indeed, to delete these observations is effectively to remove the world’s smallest countries from the ambit of my research entirely. Nevertheless, it doesn’t make sense, in my estimation, to retain them when there exist no data in V-Dem whatsoever—including the most basic of figures, such as population, GDP, etc.—on which even somewhat-credible imputations might be based.
- Select as predictors the variables I identified in my previous final project as being related in theory to HR scores, as well as the entirety of V-Dem’s high-level and medium-level indices. I do so primarily to minimize my computational workload, but also under the intuition that including variables beyond this set would yield diminishing returns with respect to accuracy.11
11 That is, including more variables would bring fewer improvements to predictive accuracy at the expense of computational speed.
12 My saved preprocessed dataset actually contains 52 variables, but three of these are PTS metrics that are removed at the beginning of recipe preprocessing.
13 Compare with Figure 1. Admittedly, the distribution might be marginally more right-skewed; this likely owes itself to the removal of the European microstates, whose human rights levels are generally very high.
Upon considering and completing these steps, I move forward with a preprocessed dataset of 49 variables and 10524 observations.12 The plot below demonstrates that the preprocessing steps involving HR scores does not significantly alter the distribution thereof:13
The preprocessed dataset also exhibits an exceedingly low degree of missingness relative to that of V-Dem overall (see Section 2), with only about about 0.19% of values from non-ID variables being missing. Figure 3, below, summarizes this missingness by predictor:
Ultimately, in virtue of the minimal extent of missingness, imputations are unlikely to prove computationally demanding or to appreciably qualify the credibility of predictions generated therefrom.
3.1 Data Splitting Specifications
Upon completion of these preprocessing steps, I split the data with proportions of 75-25 training/testing, 90-10 analysis/assessment. The cross-validation split is repeated five times, resulting in a total of 50 folds.
4 Recipes
My null (null) and basic baseline (lm) models (see Section 5) draw on three recipes, each distinguished by the number of neighbors set for KNN-imputation: 5, 10, and 20, respectively.14 I establish and test these recipe variants, at this stage, to appraise the extent to which changes in the number of neighbors set for imputation effectuate changes in model performance.
14 I considered bagged-tree imputation as well, but it proved computationally infeasible.
Aside from KNN-imputation, the main preprocessing components shared by each recipe-group are as follows:
- Transforming country ID (
cowcode) and year to a dummy variable. - Log-transforming population, GDP, and GDP-per-capita.15
- Normalizing all numeric predictors.
- Removing the year dummies subsequent to KNN-imputation.
15 These variables are widely-known as right-skewed, and they are indeed so in the V-Dem dataset. For evidence that I have verified this, see scripts/appendices/0_appendices.R.
16 The 2024 V-Dem Dataset, which will provide data up to 2023, is scheduled to be released on 07 March 2024.
17 Put differently, because year is a factor, there is no way to estimate the relationship between “2023,” “2024,” etc. and HR Scores when there exists wholesale missingness of HR Scores for those years.
This final step is important. As a factor, year can continue to be used as an imputation predictor for V-Dem observations, because there will almost certainly be V-Dem data for future years;16 yet it cannot be used to predict HR Scores, for that variable ends in 2019.17
In aggregate, the preprocessing steps results in a training set of 226 variables: 1 outcome, 2 ID variables,18 and 223 predictors—179 of which are country dummies.
18 country_name and cow_year
Also included as a baseline in Section 5 is a fourth recipe (akt_lm), which simply avails itself of my unscaled Political Violence Index (PVI), created in my previous final project, as the sole predictor of HR Scores. I do so in order to test the appropriateness of merely substituting it for HR Scores.
The ridge, lasso, and elastic-net models seen in Section 6 use the same recipe as that underpinning the 5-neighbor baselines.19 The KNN, random forest, and boosted tree models—also appearing in Section 6—rely on a similar recipe, the only difference being the inclusion of one-hot encoding for the country-ID dummy variables.20
19 See Section 5, below, for a discussion as to why I proceed with the 5-neighbor recipe for the tuning fits.
20 This means that the total number of predictors utilized by these models is 224—one more than that of the others.
To summarize, the recipes I deploy and their apportionment to the models can be organized as follows:
- Main:
- 5 Neighbors: basic baseline, null, ridge, lasso, elastic net
- 10 Neighbors: basic baseline, null
- 20 Neighbors: basic baseline, null
- Tree:
- 5 Neighbors: KNN, random forest, boosted tree
- Unscaled PVI
These amount to five recipes in total: three “main,” one “tree,” and one additional baseline (unscaled PVI).
5 Baseline Fits
Table 2, below, gives the performance of my baseline fits on the cross-validation folds as measured by mean RMSE:
| Workflow ID | Mean RMSE | Std. Error |
|---|---|---|
| neighbors_5_lm | 0.5651164 | 0.0023461 |
| neighbors_10_lm | 0.5653670 | 0.0023460 |
| neighbors_20_lm | 0.5657787 | 0.0023566 |
| akt_lm | 1.0097946 | 0.0042792 |
| neighbors_5_null | 1.4652267 | 0.0045097 |
| neighbors_10_null | 1.4652267 | 0.0045097 |
| neighbors_20_null | 1.4652267 | 0.0045097 |
As we can see, the null models perform well-worse than the basic baseline models; the unscaled PVI baseline splits the difference between the two.
There are two important takeaways at this juncture, in my view. First, the basic baseline outperforms the unscaled PVI baseline, meaning we can proceed knowing that there exist better ways to estimate HR Scores than to use the unscaled PVI as a simple substitute for HR Scores. Second, changes in the number of neighbors used for KNN-imputation seem to have little effect on model performance. This is good insofar as it is no longer a concern of ours; we will proceed with the step_impute_knn() default of neighbors = 5 for all remaining workflows, though we theoretically could have selected a larger (or smaller) number with marginal impact on our findings.
6 Tuned Fits
I subsequently tune the following hyperparameters for six models on the cross-validation folds:
- Ridge:
penalty - Lasso:
penalty - Elastic Net:
penaltyandmixture - KNN:
neighbors - Random Forest:
mtryandmin_n - Boosted Tree:
mtry,min_n, andlearn_rate
For the creation of the respective tuning grids, the first four use levels = 10, whereas the latter two use levels = 5. The random forest model further uses trees = 1000 and an mtry range set to c(1, 15), whereas the boosted tree model uses the same mtry range and a learn_rate range set to c(-3, -0.2).
The processing times for these models, with parallel processing across eight cores where possible, were approximately as follows:21
21 I simply timed these with my computer’s clock.
- Ridge: 5 minutes
- Lasso: 4 minutes
- Elastic Net: 9 minutes
- KNN: 12 minutes
- Random Forest: 92 minutes
- Boosted Tree: 65 minutes
Below is Table 3, which gives the best mean RMSE of each tuned fit:
| Model | Mean RMSE | Std. Error |
|---|---|---|
| knn | 0.2429732 | 0.0028678 |
| rf | 0.2869661 | 0.0018060 |
| bt | 0.4767500 | 0.0025412 |
| en | 0.5587930 | 0.0022834 |
| lasso | 0.5589801 | 0.0022842 |
| ridge | 0.5813393 | 0.0022810 |
The optimal tuning parameters for these models as assessed by mean RMSE are, respectively:
- KNN:
neighbors = 2 - Random Forest:
mtry = 14,min_n = 2 - Boosted Tree:
mtry = 14,min_n = 2,learn_rate = 0.631 - Elastic Net:
penalty = 1e-10,mixture = 0.683 - Lasso:
penalty = 1e-10 - Ridge:
penalty = 1e-10
Below is also a plot depicting the confidence intervals for each estimate:
| Model | Mean RMSE | Std. Error |
|---|---|---|
| knn | 0.2405136 | 0.0030602 |
| rf | 0.3061221 | 0.0016808 |
| bt | 0.5048741 | 0.0025192 |
| en | 0.5734804 | 0.0024225 |
| lasso | 0.5734929 | 0.0024186 |
| ridge | 0.5882699 | 0.0022959 |
Table 3 and the above plot evince that, while the best random-forest model possesses the lowest standard error, it is the best KNN model that has the lowest mean RMSE and confidence interval thereof. As such, we proceed with our final fit by selecting the best KNN model.
7 Final Fit
Table 5, below, gives the performance metrics for the final fit applied to the testing set:
| Metric | Estimate |
|---|---|
| rmse | 0.2305683 |
| mae | 0.1225338 |
| mape | 59.8248677 |
| rsq | 0.9741917 |
The RMSE is even lower when compared to that of the cross-validation folds (Table 4). The MAE is about half has large as the RMSE; and at approximately 0.975, the \(R^{2}\) statistic is exceedingly high. (The MAPE is about 51.2%, but this is a small figure in absolute terms, the preponderance of HR scores being clustered around 0.)
The RMSE of approximately 0.227 represents an exceedingly marginal difference. For the full set of HR Scores, the minimum is approximately -3.46, while the maximum is approximately 5.34, for a range of approximately 8.8; the RMSE’s range,22 then, represents only about 5.17% of the full range. Recalling that HR Scores are themselves estimates with a mean standard deviation of approximately 0.33 lends further credibility to the accuracy of our model.23 Qualitatively, a score difference of 0.227 does not seem to mean much either. Indeed, the examples of countries that experienced such a score change did not witness, to my knowledge, any appreciable changes in their underlying political milieus.24
22 Computed (approximately) by \(0.227*2\).
23 For the code computing this figure, see scripts/appendices/0_appendices.R.
24 For these examples, see scripts/appendices/0_appendices.R.
To round off this discussion is the below scatterplot, which depicts the relationship between our predictions and actual values in the testing set, illustrating the tight fit—and hence high degree of accuracy—of our final model:
8 Time-Series Predictions
8.1 Remaining Considerations
At this stage, I feel as though my final project is near to completion. Aside from the issues I’ve already identified as meriting potential feedback—namely, Yeo-Johnson transformation of the outcome (Section 2) and the removal of certain countries from the dataset (Section 3)—I welcome advice as to how to think about, or tackle if necessary, the issue of multicollinearity.
Needless to say, four of my five recipes are effectively “kitchen sink” in kind (the outlier being the unscaled PVI recipe). The V-Dem variables capture interrelated phenomena (e.g., elections and civil liberties) and occasionally comprise additive measures as well as their inputs (e.g., PVI and its components, the political killings and torture metrics). They also contain variants of the same variable set to different scales (e.g., the three-point, four-point, and five-point versions of the PVI). A particularly concerning example of this, to me, is the inclusion of both the unscaled PVI and original (unscaled) PVI, the latter simply being the former set to a 0-to-1 scale. The two are hence tightly related, namely through a logistic function, as demonstrated in the plot below:25
25 My work in this section can be found in scripts/appendices/.
I opted to include all such variants, for I didn’t want to presuppose that one would be “better” in predicting HR Scores than the other(s). I’m also of the understanding that multicollinearity may be less of a concern for prediction problems such as ours, where the primary objective is simply to produce the most accurate estimates possible—in contrast to inference problems, where robust coefficients and p-values are the gold standard. Nonetheless, I remain worried about the impact of multicollinearity on the strength of my predictions.
To this end, I created and briefly tested two sets of new recipes from my original set of “main” recipes (Section 4), both of which are multicollinearity averse. The first is less averse, simply removing the scaled PVI26 as a predictor from each main recipe.27 The second is more averse, removing not only said variable, but also all other variable-variants as predictors.28
26 v2x_clphy
27 I do, however, allow the scaled PVI to be present for the KNN-imputation step. This is also true for all other variable-variants in the more-averse set of recipes.
28 These variables are, namely, the PVI Ordinal (e_v2x_clphy *_3C, *_4C, *_5C), the Equality before the Law and Individual Liberty Index Ordinal (e_v2xcl_rol *_3C, *_4C, *_5C), and the Civil Liberties Index Ordinal (e_v2x_civlib *_3C, *_4C, *_5C).
For now, I tested these recipes on the baseline models, exclusively. Therefore, the recipes I deployed and their apportionment to the models can be organized as follows:
- Least Averse:
- 5 Neighbors: basic baseline, null
- 10 Neighbors: basic baseline, null
- 20 Neighbors: basic baseline, null
- Most Averse:
- 5 Neighbors: basic baseline, null
- 10 Neighbors: basic baseline, null
- 20 Neighbors: basic baseline, null
These amount to six recipes in total: three that are less multicollinearity averse, and three that are more multicollinearity averse.
The performance of these fits on the cross-validation folds, as measured by mean RMSE, are given in Table 6 and Table 7, below:
| Workflow ID | Mean RMSE | Std. Error |
|---|---|---|
| neighbors_5_lm | 0.5667459 | 0.0023647 |
| neighbors_10_lm | 0.5670072 | 0.0023641 |
| neighbors_20_lm | 0.5674522 | 0.0023732 |
| neighbors_5_null | 1.4652267 | 0.0045097 |
| neighbors_10_null | 1.4652267 | 0.0045097 |
| neighbors_20_null | 1.4652267 | 0.0045097 |
| Workflow ID | Mean RMSE | Std. Error |
|---|---|---|
| neighbors_5_lm | 0.5728643 | 0.0024738 |
| neighbors_10_lm | 0.5731172 | 0.0024712 |
| neighbors_20_lm | 0.5735224 | 0.0024798 |
| neighbors_5_null | 1.4652267 | 0.0045097 |
| neighbors_10_null | 1.4652267 | 0.0045097 |
| neighbors_20_null | 1.4652267 | 0.0045097 |
The basic baseline (*_lm) results seemingly signify a deterioration in quality: vis-à-vis the metrics seen in Table 2, the models perform worse, with gradual yet consistent increases in both mean RMSE and standard error as they become more multicollinearity averse.
If predictive accuracy is the chief objective, then these results suggest that I should not, in fact, proceed with the multicollinearity-averse recipes—irrespective of whether I ultimately use the values produced from my model for inferential tasks. Nevertheless, I am open to feedback that would confirm or challenge my interpretation of or decision-making around this topic.
9 Appendices
9.1 Distribution of Yeo-Johnson-transformed HR scores
The transformation does rectify the right-skew seen in Figure 1, if at the expense of introducing a slight left-skew. It also fails to eliminate the bimodality feature.